Robust PDF Document Conversion using Recurrent Neural Networks

نویسندگان

چکیده

The number of published PDF documents in both the academic and commercial world has increased exponentially recent decades. There is a growing need to make their rich content discoverable information retrieval tools. Achieving high-quality semantic searches demands that document's structural components such as title, section headers, paragraphs, (nested) lists, tables figures (including captions) are properly identified. Unfortunately, format known not conserve because it simply represents document stream low-level printing commands, which one or more characters placed bounding box with particular styling. In this paper, we present novel approach structure recovery using recurrent neural networks process data representation directly, instead relying on visual re-interpretation rendered page, been proposed previous literature. We demonstrate how sequence commands can be used input into network learn classify each command according its function page. This three advantages: First, distinguish among fine-grained labels (typically 10-20 opposed 1-5 methods), results accurate detailed resolution. Second, take account text flow across pages naturally compared methods concatenate sequential pages. Last, our method needs less memory computationally expensive than methods. allows us deploy models production environments at much lower cost. Through extensive architectural search combination advanced feature engineering, were able implement model yields weighted average F1 score 97% 17 distinct labels. best achieved currently served Corpus Conversion Service (CCS), was presented KDD18. enhances capabilities CCS significantly, eliminates for human annotated label ground-truth every unseen layout. proved particularly useful when applied huge corpus articles related COVID-19.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

rodbar dam slope stability analysis using neural networks

در این تحقیق شبکه عصبی مصنوعی برای پیش بینی مقادیر ضریب اطمینان و فاکتور ایمنی بحرانی سدهای خاکی ناهمگن ضمن در نظر گرفتن تاثیر نیروی اینرسی زلزله ارائه شده است. ورودی های مدل شامل ارتفاع سد و زاویه شیب بالا دست، ضریب زلزله، ارتفاع آب، پارامترهای مقاومتی هسته و پوسته و خروجی های آن شامل ضریب اطمینان می شود. مهمترین پارامتر مورد نظر در تحلیل پایداری شیب، بدست آوردن فاکتور ایمنی است. در این تحقیق ...

Solving Linear Semi-Inﬁnite Programming Problems Using Recurrent Neural Networks

‎Linear semi-inﬁnite programming problem is an important class of optimization problems which deals with infinite constraints‎. ‎In this paper‎, ‎to solve this problem‎, ‎we combine a discretization method and a neural network method‎. ‎By a simple discretization of the infinite constraints,we convert the linear semi-infinite programming problem into linear programming problem‎. ‎Then‎, ‎we use...

متن کامل

Document classification and recurrent neural networks

The paper describes an automatic document classification system called NeuroClass, developed for the Air Transportation Field of Transport Canada. NeuroClass is a working classification tool for natural language text, based on recurrent neural network technology. In laboratory tests, it outperformed prototypes developed with other neural network paradigms.

متن کامل

Efficient Short-Term Electricity Load Forecasting Using Recurrent Neural Networks

Short term load forecasting (STLF) plays an important role in the economic and reliable operation ofpower systems. Electric load demand has a complex profile with many multivariable and nonlineardependencies. In this study, recurrent neural network (RNN) architecture is presented for STLF. Theproposed model is capable of forecasting next 24-hour load profile. The main feature in this networkis ...

متن کامل

Multitask learning in connectionist robust ASR using recurrent neural networks

The use of prior knowledge in machine learning techniques has been proved to give better generalisation performance for unseen data. However, this idea has not been investigated so far for robust ASR. Training several related tasks simultaneously is also called multitask learning (MTL): the extra tasks effectively incorporate prior knowledge. In this work we present an application of MTL in rob...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i17.17777